Search CORE

17 research outputs found

VGGFace2: A dataset for recognising faces across pose and age

Author: Cao Qiong
Parkhi Omkar M.
Shen Li
Xie Weidi
Zisserman Andrew
Publication venue
Publication date: 01/01/2018
Field of study

In this paper, we introduce a new large-scale face dataset named VGGFace2. The dataset contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession (e.g. actors, athletes, politicians). The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise. We describe how the dataset was collected, in particular the automated and manual filtering stages to ensure a high accuracy for the images of each identity. To assess face recognition performance using the new dataset, we train ResNet-50 (with and without Squeeze-and-Excitation blocks) Convolutional Neural Networks on VGGFace2, on MS- Celeb-1M, and on their union, and show that training on VGGFace2 leads to improved recognition performance over pose and age. Finally, using the models trained on these datasets, we demonstrate state-of-the-art performance on all the IARPA Janus face recognition benchmarks, e.g. IJB-A, IJB-B and IJB-C, exceeding the previous state-of-the-art by a large margin. Datasets and models are publicly available.Comment: This paper has been accepted by IEEE Conference on Automatic Face and Gesture Recognition (F&G), 2018. (Oral

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Template Adaptation for Face Verification and Identification

Author: Byrne Jeffrey
Cao Qiong
Crosswhite Nate
Parkhi Omkar M.
Stauffer Chris
Zisserman Andrew
Publication venue
Publication date: 05/04/2016
Field of study

Face recognition performance evaluation has traditionally focused on one-to-one verification, popularized by the Labeled Faces in the Wild dataset for imagery and the YouTubeFaces dataset for videos. In contrast, the newly released IJB-A face recognition dataset unifies evaluation of one-to-many face identification with one-to-one face verification over templates, or sets of imagery and videos for a subject. In this paper, we study the problem of template adaptation, a form of transfer learning to the set of media in a template. Extensive performance evaluations on IJB-A show a surprising result, that perhaps the simplest method of template adaptation, combining deep convolutional network features with template specific linear SVMs, outperforms the state-of-the-art by a wide margin. We study the effects of template size, negative set construction and classifier fusion on performance, then compare template adaptation to convolutional networks with metric learning, 2D and 3D alignment. Our unexpected conclusion is that these other methods, when combined with template adaptation, all achieve nearly the same top performance on IJB-A for template-based face verification and identification

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild

Author: Ben-Ahmed Olfa
Goodfellow I. J.
Han Kun
Ioffe Sergey
Khorrami Pooya
Parkhi Omkar M
Szegedy Christian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data

Crossref

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Hal-Diderot

AXES at TRECVID 2012: KIS, INS, and MED

Author: Aly Robin
Arandjelovic Relja
Chatfield Ken
Chen Shu
Douze Matthijs
Fernando Basura
Harchaoui Zaid
McGuinness Kevin
O'Connor Noel E.
Oneata Dan
Parkhi Omkar M.
Potapov Danila
Revaud Jérôme
Schmid Cordelia
Schwenninger Jochen
Tuytelaars Tinne
Verbeek Jakob
Wang Heng
Zisserman Andrew
Publication venue
Publication date: 01/01/2012
Field of study

The AXES project participated in the interactive instance search task (INS), the known-item search task (KIS), and the multimedia event detection task (MED) for TRECVid 2012. As in our TRECVid 2011 system, we used nearly identical search systems and user interfaces for both INS and KIS. Our interactive INS and KIS systems focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our KIS experiments were media professionals from the BBC; our INS experiments were carried out by students and researchers at Dublin City University. We performed comparatively well in both experiments. Our best KIS run found 13 of the 25 topics, and our best INS runs outperformed all other submitted runs in terms of P@100. For MED, the system presented was based on a minimal number of low-level descriptors, which we chose to be as large as computationally feasible. These descriptors are aggregated to produce high-dimensional video-level signatures, which are used to train a set of linear classifiers. Our MED system achieved the second-best score of all submitted runs in the main track, and best score in the ad-hoc track, suggesting that a simple system based on state-of-the-art low-level descriptors can give relatively high performance. This paper describes in detail our KIS, INS, and MED systems and the results and findings of our experiments

Hal - Université Grenoble Alpes

Fraunhofer-ePrints

Irish Universities

INRIA a CCSD electronic archive server

DCU Online Research Access Service

HAL-Rennes 1

EgoBlur: Responsible Innovation in Aria

Author: Gupta Prince
Meissner Jeff
Miglani Sagar
Miller Edward
Newcombe Richard
Parkhi Omkar M
Pesqueira Luis
Prasad Ishita
Raina Nikhil
Ren Carl
Saarinen Steve
Schwesinger Mark
Somasundaram Guruprasad
Yan Mingfei
Zheng Kang
Publication venue
Publication date: 06/09/2023
Field of study

Project Aria pushes the frontiers of Egocentric AI with large-scale real-world data collection using purposely designed glasses with privacy first approach. To protect the privacy of bystanders being recorded by the glasses, our research protocols are designed to ensure recorded video is processed by an AI anonymization model that removes bystander faces and vehicle license plates. Detected face and license plate regions are processed with a Gaussian blur such that these personal identification information (PII) regions are obscured. This process helps to ensure that anonymized versions of the video is retained for research purposes. In Project Aria, we have developed a state-of-the-art anonymization system EgoBlur. In this paper, we present extensive analysis of EgoBlur on challenging datasets comparing its performance with other state-of-the-art systems from industry and academia including extensive Responsible AI analysis on recently released Casual Conversations V2 dataset

arXiv.org e-Print Archive

The AXES research video search system

Author: Aly Robin
Arandjelovic Relja
Chatfield Ken
de Jong Franciska
Douze Matthijs
Kemman Max
Kleppe Martijn
Macquarrie Kay
McGuinness Kevin
O'Connor Noel E.
Ozerov Alexey
Parkhi Omkar M.
Perez Patrick
Schmid Cordelia
van der Kreeft Peggy
Zisserman Andrew
Publication venue
Publication date: 04/05/2014
Field of study

We will demonstrate a multimedia content information retrieval engine developed for audiovisual digital libraries targeted at academic researchers and journalists. It is the second of three multimedia IR systems being developed by the AXES project1. The system brings together traditional text IR and state-of-the-art content indexing and retrieval technologies to allow users to search and browse digital libraries in novel ways. Key features include: metadata and ASR search and filtering, on-the-fly visual concept classification (categories, faces, places, and logos), and similarity search (instances and faces)

Irish Universities

DCU Online Research Access Service

The AXES submissions at TrecVid 2013

Author: Aly Robin
Arandjelovic Relja
Chatfield Ken
Douze Matthijs
Fernando Basura
Harchaoui Zaid
McGuinness Kevin
O'Connor Noel E.
Oneata Dan
Parkhi Omkar M.
Potapov Danila
Revaud Jérôme
Schmid Cordelia
Schwenninger Jochen
Scott David
Tuytelaars Tinne
Verbeek Jakob
Wang Heng
Zisserman Andrew
Publication venue
Publication date: 01/11/2013
Field of study

The AXES project participated in the interactive instance search task (INS), the semantic indexing task (SIN) the multimedia event recounting task (MER), and the multimedia event detection task (MED) for TRECVid 2013. Our interactive INS focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our INS experiments were carried out by students and researchers at Dublin City University. Our best INS runs performed on par with the top ranked INS runs in terms of P@10 and P@30, and around the median in terms of mAP. For SIN, MED and MER, we use systems based on state- of-the-art local low-level descriptors for motion, image, and sound, as well as high-level features to capture speech and text and the visual and audio stream respectively. The low-level descriptors were aggregated by means of Fisher vectors into high- dimensional video-level signatures, the high-level features are aggregated into bag-of-word histograms. Using these features we train linear classifiers, and use early and late-fusion to combine the different features. Our MED system achieved the best score of all submitted runs in the main track, as well as in the ad-hoc track. This paper describes in detail our INS, MER, and MED systems and the results and findings of our experimen

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Irish Universities

DCU Online Research Access Service

Features and methods for improving large scale face recognition

Author: Omkar M Parkhi
Publication venue
Publication date: 01/01/2015
Field of study

This thesis investigates vector representations for face recognition, and uses these representations for a number of tasks in image and video datasets. First, we look at different representations for faces in images and videos. The objective is to learn compact yet effective representations for describing faces. We first investigate the use of "Fisher Vector" descriptors for this task. We show that these descriptors are perfectly suited for face representation tasks. We also investigate various approaches to effectively reduce their dimension while improving their performance further. These "Fisher Vector" features are also amenable to extreme compression and work equally well when compressed by over 2000 times as compared to their non compressed counterparts. These features achieved the state-of-the-art results on challenging public benchmarks until the re-introduction of Convolution Neural Networks (CNNs) in the community. Second, we investigate the use of "Very Deep" architectures for face representation tasks. For training these networks, we collected one of the largest annotated public datasets of celebrity faces with minimum manual intervention. We bring out specific details of these network architectures and their training objective functions essential to their performance and achieve state-of-art result on challenging datasets. Having developed these representation, we propose a method for labeling faces in the challenging environment of broadcast videos using their associated textual data, such as subtitles and transcripts. We show that our CNN representation is well suited for this task. We also propose a scheme to automatically differentiate the primary cast of a TV serial or movie from that of the background characters. We modify existing methods of collecting supervision from textual data and show that the careful alignment of video and textual data results in significant improvement in the amount of training data collected automatically, which has a direct positive impact on the performance of labeling mechanisms. We provide extensive evaluations on different benchmark datasets achieving, again, state-of-the-art results. Further we show that both the shallow as well the deep methods have excellent capabilities in switching modalities from photos to paintings and vice-a-versa. We propose a system to retrieve paintings for similar looking people given a picture and investigate the use of facial attributes for this task. Finally, we show that an on-the-fly real time search system can be built to search through thousands of hours of video data starting from a text query. We propose product quantization schemes for making face representations memory efficient. We also present the demo system based on this design for the British Broadcasting Corporation (BBC) to search through their archive. All of these contributions have been designed with a keen eye on their application in the real world. As a result, most of chapters have an associated code release and a working online demonstration.</p

Oxford University Research Archive

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

Author: Du Yong
King Davis E
Parkhi Omkar M
Zhu Xiangxin
Publication venue: HAL CCSD
Publication date: 21/09/2017
Field of study

International audienceThis paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework – lying in describing videos by audio and visual features used by a supervised classifier to infer the labels – this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convo-lutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %

arXiv.org e-Print Archive

HAL - Normandie Université

Crossref